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Abstract 

This paper uses the notion of algorithmic stability to derive novel generalization bounds for several 
families of transductive regression algorithms, both by using convexity and closed-form solutions. 
Our analysis helps compare the stability of these algorithms. It also shows that a number of 
widely used transductive regression algorithms are in fact unstable. Finally, it reports the results 
of experiments with local transductive regression demonstrating the benefit of our stability bounds 
for model selection, for one of the algorithms, in particular for determining the radius of the local 
neighborhood used by the algorithm. 
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1. Introduction 

The problem of transductive inference was originally introduced by Vapnik (1982). Many learning 
problems in information extraction, computational biology, natural language processing and other 
domains can be formulated as a transductive inference problem. In the transductive setting, the 
learning algorithm receives both a labeled training set, as in the standard induction setting, and a set 
of unlabeled test points. The objective is to predict the labels of the test points. No other test points 
will ever be considered. This setting arises in a variety of applications. Often, there are orders of 
magnitude more unlabeled points than labeled ones and they have not been assigned a label due to 
the prohibitive cost of labeling. This motivates the use of transductive algorithms which leverage 
the unlabeled data during training to improve learning performance. 

This paper deals with transductive regression, which arises in problems such as predicting the 
real-valued labels of the nodes of a fixed (known) graph in computational biology, or the scores as- 
sociated with known documents in information extraction or search engine tasks. Several algorithms 
have been devised for the specific setting of transductive regression (Belkin et al., 2004b; Chapelle 
et al., 1999; Schuurmans and Southey, 2002; Cortes and Mohri, 2007). Several other algorithms 
introduced for transductive classification can be viewed in fact as transductive regression ones as 
their objective function is based on the square loss, for example, in Belkin et al. (2004a,b). Cortes 
and Mohri (2007) gave explicit VC-dimension generalization bounds for transductive regression 
that hold for all bounded loss functions and coincide with the tight classification bounds of Vapnik 
(1998) when applied to classification. 

We present novel algorithm-dependent generalization bounds for transductive regression. Since 
they are algorithm-specific, these bounds can often be tighter than bounds based on general com- 
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plexity measures such as the VC-dimension. Our analysis is based on the notion of algorithmic 
stability and our learning bounds generalize to the transduction scenario the stability bounds given 
by Bousquet and Elisseeff (2002) for the inductive setting and extend to regression the stability- 
based transductive classification bounds of El-Yaniv and Pechyony (2006). 

In Section 2 we give a formal definition of the transductive inference learning set-up, including 
a precise description and discussion of two related transductive settings. We also introduce the 
notions of cost and score stability used in the following sections. 

Standard concentration bounds such as McDiarmid's bound (McDiarmid, 1989) cannot be read- 
ily applied to the transductive regression setting since the points are not drawn independently but 
uniformly without replacement from a finite set. Instead, Section 3.1 proves a concentration bound 
generalizing McDiarmid's bound to the case of random variables sampled without replacement. 
This bound is slightly stronger than that of El-Yaniv and Pechyony (2006, 2007) and the proof 
much simpler and more concise. This concentration bound is used to derive a general transductive 
regression stability bound in Section 3.2. Figure 1 shows the outline of the paper. 

Section 4 introduces and examines a very general family of tranductive algorithms, that of local 
transductive regression (LTR) algorithms, a generalization of the algorithm of Cortes and Mohri 
(2007). It gives general bounds for the stability coefficients of LTR algorithms and uses them to 
derive stability-based learning bounds for these algorithms. The stability analysis in this section is 
based on the notion of cost stability and based on convexity arguments. 

In Section 5, we analyze a general class of unconstrained optimization algorithms that includes 
a number of recent algorithms (Wu and Scholkopf, 2007; Zhou et al., 2004; Zhu et al., 2003). The 
optimization problems for these algorithms admit a closed-form solution. We use that to give a 
score-based stability analysis of these algorithms. Our analysis shows that in general these algo- 
rithms may not be stable. In fact, in Section 5.4 we prove a lower bound on the stability coefficient 
of these algorithms under some assumptions. 

Section 6 examines a class of constrained regularization optimization algorithms for graphs that 
enjoy better stability properties than the unconstrained ones just mentioned. This includes the graph 
Laplacian algorithm of Belkin et al. (2004a). In Section 6.2, we give a score stability analysis with 
novel generalization bounds for this algorithm, simpler and more general than those given by Belkin 
et al. (2004a). Section 6.3 shows that algorithms based on constrained graph regularizations are in 
fact special instances of the LTR algorithms by showing that the regularization term can be written 
in terms of a norm in a reproducing kernel Hilbert space. This is used to derive a cost stability 
analysis and novel learning bounds for the graph Laplacian algorithm of Belkin et al. (2004a) in 
terms of the second smallest eigenvalue of the Laplacian and the diameter of the graph. Much of 
the results of these sections generalize to other constrained regularization optimization algorithms. 
These generalizations are briefly discussed in Section 6.4 where it is indicated, in particular, how 
similar constraints can be imposed to the algorithms of Wu and Scholkopf (2007); Zhou et al. 
(2004); Zhu et al. (2003) to derive new and stable versions of these algorithms. 

Finally, Section 7 shows the results of experiments with local transductive regression demon- 
strating the benefit of our stability bounds for model selection, in particular for determining the 
radius of the local neighborhood used by the algorithm, which provides a partial validation of our 
bounds and analysis. 
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2. Definitions 

Let X denote the input space and y a measurable subset of R. 

2.1 Transductive learning set-up 

In transductive learning settings, the algorithm receives a labeled training set S of size m, S = 
((x\,yi), . . . , (x m ,y m )) G X x y, and an unlabeled test set T of size u, x m +i, ■ ■ ■ , x m+u G X. The 
transductive learning problem consists of predicting accurately the labels y m +i, . . . ,y m +u of the 
test examples, no other test example is ever considered. Two different settings can be distinguished 
to formalize this problem, see (Vapnik, 1998). 

Setting 1 In this setting, a full sample X of m + u examples is given. The learning algorithm 
further receives the labels of a training sample S of size m selected from X uniformly at random 
without replacement. The remaining u unlabeled examples serve as a test sample T. We denote by 
X = (S, T) a partitioning of X into a training set S and test set T. 

Setting 2 Here, the training sample S and test sample T are both drawn i.i.d. according to some 
distribution D. The labeled sample S and the test points T, without their labels, are made available 
to the learning algorithm. 

As in previous theoretical studies of the transduction problem, e.g., (Vapnik, 1998; Derbeko 
et al., 2004; Cortes and Mohri, 2007; El-Yaniv and Pechyony, 2006), we analyze setting 1 and 
derive generalization bounds for this specific setting. However, as pointed out by Vapnik (1998), 
any generalization bound in the setting we analyze directly yields a bound for setting 2 by taking 
the expectation. 

The specific problem where the labels are real-valued numbers, as in the case studied in this 
paper, is that of transductive regression. It differs from the standard inductive regression since the 
learning algorithm is given the unlabeled test examples beforehand and can thus can possibly exploit 
that information to improve its performance. 

2.2 Notions of stability 

We denote by c(h, x) the cost of an error of a hypothesis /iona point x labeled with y(x). The cost 
function commonly used in regression is the square loss c(h, x) = [h(x) — y(x)] 2 . We shall assume 
a square loss for the remaining of this paper, but many of our results generalize to other convex cost 
functions. The training error R(h) and test error R(h) of a hypothesis h are defined as follows: 



The generalization bounds we derive are based on the notion of algorithmic stability. We shall use 
the following two notions of stability in our analysis. 

Definition 1 (Cost stability) Let L be a transductive learning algorithm and let h denote the hy- 
pothesis returned by Lfor X = (S, T) and h! the hypothesis returned for X = (S 1 ,T'), where S 
and S' differ in exactly one point. L is said to be uniformly /3-stable with respect to the cost function 
c if there exists [3 > such that for all x G X, 




(1) 



k=l 



k=l 



c(h',x) - c{h,x)\ < (3. 



(2) 
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Figure 1 : A high-level outline of the paper. 



Definition 2 (Score stability) Let L be a transductive learning algorithm and let h denote the hy- 
pothesis returned by Lfor X = (5, T) and h! the hypothesis returned for X = (S', T'). L is said to 
be uniformly /3-stable with respect to its output scores if there exists (3 > such that for all x G X, 

\ti(x)-h(x)\< f3. (3) 

We will say that a hypothesis set H is bounded by B > when \h(x) — y(x)\ < B for all x G X 
and h G H. For such a hypothesis set and the square loss, for any two hypotheses h,h! G H and 
x G X, the following inequality holds: 

\c(h',x)-c(h,x)\ = \[h'(x)-y(x)] 2 - [h(x)-y(x)] 2 \ (4) 
= \h'{x) - h(x)\\h'(x) - y(x) + h(x) - y(x)\ (5) 
< 2B\h'(x) -h(x)\. (6) 

Thus, for H bounded by B and the square loss, /3-score-stability implies 2£?/3-cost-stability. 

For the remainder of this paper, unless otherwise specified, stability is meant as cost-based 
stability 

3. General transduction stability bounds 

Stability-based generalization bounds in the inductive setting are derived using McDiarmid's in- 
equality (McDiarmid, 1989). The main technique used is to show that under suitable conditions on 
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the stability of the algorithm, the difference of the test error and the training error is sharply con- 
centrated around its expected value, and that this expected value itself is small. Roughly speaking, 
this implies that with high probability, the test error is close to the training error. Since the points in 
the training and test sample are drawn in an i.i.d. fashion, McDiarmid's inequality can be applied. 

However, in the transductive setting, the sampling random variables are not drawn indepen- 
dently. Thus, McDiarmid's concentration bound cannot be readily used in this case. Instead, a 
generalization of McDiarmid's bound that holds for random variables sampled without replacement 
is needed. We present such a generalization in this section with a concise proof. A slightly weaker 
version of this bound with a somewhat more complex proof was derived by El-Yaniv and Pechyony 
(2006, 2007). 



3.1 Concentration bound for sampling without replacement 



To derive this concentration bound, we use the method of averaged bounded differences and the 
following theorem due to Azuma (1967) and McDiarmid (1989). where we denote by the 
subsequence of random variables Si,...,Sj and write = x- as a shorthand for the event 

Si — . . . , Sj — Xj. 

Theorem 3 (McDiarmid (1989), Th. 6.10) Let S™ be a sequence of random variables with each 
S{ taking values in X. Let <p : X m — > R be a measurable function satisfying the following condi- 
tions: 



Vt G [l,m],Vx^ G X, E ST+i [^S'f 1 ,^ = x t ] - E sr+1 [<t>\S\-\Si = X 
Then, for all e > 0, 

r I T -2e 2 " 

Pr <j> - E [4>] > e < exp =^ — f . 

L J V l^i=i c i- 

The following definition is needed for the presentation of our concentration bound. 



< a- 



(7) 



Definition 4 (Symmetric Functions) A function <\> : X m — > R is said to be symmetric if its value 
does not depend on the order of its arguments, that is for any two permutations a and a 1 over [1, m] 
and any m points x\, . . . , x m G X, . . . ,x CT(m )) = $(av (1) , . . . ,av(m))- 



Theorem 5 (Concentration bound for sampling without replacement) Let be a sequence of 
random variables, sampled uniformly without replacement from a fixed set X of m + u elements, 
and let cf) : X m -^Mbe a symmetric function such that for all i G [1, m] and for all x\, . . . , x m G X 
and x[, . . . , x' m G X, 



(xi , . . . , X m ) (f)(xi, . . . , Xi—l, Xi , , . . . , x m ) 



< c. 



Then, for all e > 0, 



Pr 



E 16] > e 



< 



cxp 



-2e 2 



a(m, u)c 2 



(8) 



where a(m, u) 



m+u— 1/2 1—1/(2 max{m,u}) " 
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Proof Fix i G [1, m] and define g(S\ 1 ,Xi,x' i ) as follow: 

fKSr 1 ,^) = E sr+1 [0IS1- 1 , Si = Xi] - E sr+i [0| Si" 1 , 5, = x[] . (9) 

Then, 

5( x l 1 i x ii x 'i) = ^2 <K x i 1 j x ii x i+i) P r [S^i = x™ 1 \S\ 1 = 1 ,Si = Xi] 

x'™ 

»+l 

— E 0(xj 1 , x'j, x' i+1 ) Pr[S™ x = x'j +1 |Si 1 = x^ 1 ,5j = x^]. 

x i+1 

We show that g{^{ , Xi, x'j) can be bounded by c, = m +„_j and apply Theorem 3 to obtain the 
bound claimed. For uniform sampling without replacement, the probability terms can be written 
explicitly: 

m—l j j 

P r = x i+ll^l 1 = x l \ Si = Xi] = J 



m + u — k (m + u 
k=i 



Thus, 

g(^\x u x^) = ul \J2 H4'\ x i^T+i) - E ^r'^^m) 



4 i+1 



To compute 0(x^ , x^x^) — J^x"^ <H x i > x 'v x 'i+i)> we divide the set of permutations 

{ x 'i+ 1} into two sets, those that contain the element Xi and those that do not. If a permutation x'™ 1 
contains Xi we can write it as x'*r 1 1 XiX / ^. 1> where A; is such that x' k = Xj. We then match it up with 
the permutation Xix'^^x'^! from the set {xjX^}. These two permutations contain exactly the 
same elements, and since the function <j) is symmetric in its arguments, the difference in the value 
of the function on the permutations is zero. 

In the other case, if a permutation x'™ l does not contain the element Xj, then we simply match 
it up with the same permutation in {x^}. The matching permutations appearing in the summation 
are then Xjx'™ x and x^x'^ x which clearly only differ with respect to x.j. The difference in the value 
of the function <p in this case can be bounded by c. The number of such permutations can be counted 
as follows: it is the number of permutations of length m — i from the set X of m + u elements that 
do not contain any of the elements of x^ _1 , X j and x\, which is equal to ^^Ejw^ ■ This leads us 
to the following upper bound on ^ x m j 0(x^ _1 , x«, x^) - X^x" 7 ^ ^( x i~ ^ x 'v x 'i+i) : 

E Mr 1 - E #*rv«y£i) < {m+ u~ly. 1)1 c > (10) 

i+l x i+1 

which implies that \gix\~ 1 , x it acj)| < / m+ „_ 0! ■ ^7^7^ < To a PP L y Theorem 3, we 



need to bound Yh=i { m +u-i ) ■ To this end ' note that 

m 1 m+u-l 1 /-m+u- 1/2 j„ _ 1 

V" i = y -< / — = - i . (11) 

^— ' (m + ti — i) 2 j 2 7„_i/o x 2 ?7i + it — 1/2 u — 1/2 

j=l v ' 3=u I 
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The application of Theorem 3 then yields: 



Pr 



E 



> e 



< exp 



-2e 2 



a u (m, u)c 2 



(12) 



where a u (m, u) 



Function is symmetric in m and u in the sense that se- 



m+u-l/2 l-l/(2u) ■ 

lecting one of the sets uniquely determines the other. The statement of the theorem then follows by 
obtaining a similar bound with a m (m, u) = m _ i n ^ 1 ^ 2 i-i/(2to) anc ^ ta ki n § tne tighter of the two 
bounds. 



3.2 Transductive stability bound 

Observe that, since the full sample X is given, the average error of a hypothesis h G H over X 
defined by Rx(h) = M x * s not a random variable. Also, for any training sample 5, 

the test error R(h) can be expressed in terms of Rx{h) and the empirical error R(h) as follows: 

R(h) = -Y j h{x m+l ) = -(( m + u )R x (h)-y j h(x i )) = Ht±l Rx ( h ) - -R(h). (13) 

1=1 1=1 

Thus, for a fixed h, the quantity R(h) - R(h) = ^Rx(h) - ^R{h) only varies with R(h) 
and is only a function of the training sample 5. Let be defined by </>(5) = R(h) — R(h). Since 
permuting the points of S does not affect R(h), 3> is symmetric. 

To obtain a general transductive regression stability bound, we apply the concentration bound 
of Theorem 5 to the random variable 0(5). To do so, we need to bound Eg [0(5)], where 5 is a 
random subset of X of size m, and |0(5) — <fi(S')\ where 5 and 5' are samples differing by exactly 
one point. The following lemma proves a Lipschitz condition for <E>. 

Lemma 6 Let H be a hypothesis set bounded by B. Let L be a (3-cost-stable algorithm and let S 
and S' be two training sets of size m that differ in exactly one point. Let h S H be the hypothesis 
returned by L when trained on S and h! G H the one returned when L is trained on S'. Then, 

10(5) - 0(5')| < 2/3 + Hl + i^ (14) 

mu 

Proof By the definition of 5', there exist i G [1, m] and j G [1, u] such that S' = S\{xi}U{x m+ j}. 
0(5) — 0(5') can written as follows: 

j u ^ m 

0(5) -0(5') = - ^ [c(h,x m +k) ~ c(h',x m+k )] H ^ [c(hf, x k ) - c(h, x k )] 

k=l,kj^j k=l,kj^i 

+- \c(h,x m+j ) - c(h',Xi)] + — \c(h',x m+j ) - c(h,Xi)] . 
u m 

Since the hypothesis set H is bounded by B, the square loss c is bounded by B 2 , c(h, x) < B 2 for 
all x G X, h G ff. Thus, 

ws) _^,i<(^ + (r^ + ^ + ^< 2 ^ + ^(I + I). as, 

u m u m \u m I 
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The next lemma bounds the expectation of <£. 

Lemma 7 Let h be the hypothesis returned by a (3-cost-stable algorithm L. Then, the following 
inequality holds for the expectation ofQ: 



\®s[<KS)]\< fl- 



Proof By the definition of <fi(S), we can write 



1 1 

E s 1<P(S)} = E s [R(h)] - E s \R(h)} =-VE s [c(h, x m+k )\ --Ve s [c(h, x k )} . (17) 

k=l k=l 



Es [c(h, x m+ k)] is the same for all 1 < k < u, and similarly, Es [c(h, Xk)] is the same for all 
1 < k < m. Let i G [l,rn] and j € [l,u], and let 5' be defined as in the previous lemma: 
S' = S \ {xi} U {x m+ j}, and let h! denote a hypothesis trained on S', then the following holds: 



E S [0(5)] = E S [c(h j Xm+j )}-E s [c(h,xi)} 
= E S , [c(h', Xi )] -E s [c(Mi)] 
= E 5 ,s' [c(h',Xi) - c(h,Xi)] < (3, 



(18) 
(19) 
(20) 



by the cost /3-stability of the algorithm. 



Theorem 8 Let H be a hypothesis set bounded by B and L a ^-cost-stable algorithm. Let h be the 
hypothesis returned by L when trained on X = (S, T). Then, for any 5>0, with probability at least 
1-6, 

. . ~ . / B 2 (m + u)\ a(m,u)lnl 
R(h) < R(h) + (3 + [2(3 + — i >-)\ K ' ' s . (21) 

Proof The result follows directly from Theorem 5 and Lemmas 6 and 7. ■ 

The bound of Theorem 8 is a general bound that applies to any transductive algorithm. To apply it, 
the stability coefficient (3, which depends on m and u, needs to be determined. In the subsequent 
sections, we derive bounds on (3 for a number of transductive regression algorithms (Cortes and 
Mohri, 2007; Belkin et al., 2004a; Wu and Scholkopf, 2007; Zhou et al., 2004; Zhu et al., 2003). 
Note that when (3 = 0(1/ min(m, u)), the slack term of this bound is in 0(1/ ' \J min(m, u)). 

4. Stability of local transductive regression algorithms 

This section describes and analyzes a general family of local transductive regression algorithms 
(LTR) generalizing the algorithm of Cortes and Mohri (2007). 
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4.1 Local transductive regression algorithms 

LTR algorithms can be viewed as a generalization of the so-called kernel regularization-based learn- 
ing algorithms to the transductive setting. The objective function that is minimized is of the form: 

F(f,S) = H/lll + -VV/.Xfc) + -V?(/,x m+t ), (22) 
m ^-^ u 

k=l k=l 

where \\-\\ K is the norm in the reproducing kernel Hilbert space (RKHS) with associated kernel K, 
C > and C' > are trade-off parameters, / is the hypothesis and c(f, x) = (f(x) — y(x)) 2 is the 
error of / on the unlabeled point x with respect to a pseudo-target y . 

Pseudo-targets are obtained from neighborhood labels y{x) by a local weighted average or other 
regression algorithms applied locally. Neighborhoods can be defined as a ball of radius r around 
each point in the feature space. We will denote by /3/ oc the score-stability coefficient (Definition 2). 

4.2 Generalization bounds 

In this section, we use the bounded-labels assumption, that is we shall assume that for all x G S, 
\y(x)\ < M for some M > 0. We also assume that for any x G X, K(x, x) < k 2 . We will use 
the following bound based on the reproducing property and the Cauchy-Schwarz inequality valid 
for any hypothesis h G H, and for all 



\h(x)\ = \{h,K(x,-))\ < \\h\\ K y/Kfcx) < K \\h\\ K . (23) 

Lemma 9 Let h be the hypothesis minimizing (22). Assume that for any x G X, K(x, x) < k 2 . 
Then, for any xeX,\h(x)\< nMy/C + C 

Proof The proof is an adaptation of the technique of Bousquet and Elisseeff (2002) to LTR algo- 
rithms. By Equation 23, \h(x)\ < n\\h\\ K . Let G M m+11 be the hypothesis assigning label zero to 
all examples. By the definition of h, 

F(h,S) < F(0,S) < (C + C')M 2 . (24) 

Using the fact that \\h\\ K < \J F(h, S) yields the statement of the lemma. ■ 

Since \h(x)\ < kM\/ C + C, this immediately gives us a bound on \h(x) — y{x)\: 

\h{x) - y{x)\ < M(l + kVC + C), (25) 

and we are in a position to apply Theorem 8 with B = AM, A = 1 + K\[C + C. 

Let h be a hypothesis obtained by training on S and h! by training on S'. To determine the 
cost-stability coefficient /3, we must upper-bound \c(h, x) — c(h', x)\. Let Ah = h — h! . Then, for 
all x £ X, 



c(h, x) - c{h' : x)\ = Ah(x) [{h(x) - y(x)) + (h'(x) - y(x))) 



(26) 



< 2M(l + KVC + C')\Ah(x)\. (27) 
As in Inequality 23, for all |A/t(x)| < k||A/i||^-, thus for all x G X, 

\c{h,x) -c(h',x)\ < 2M(1 + k^C + C')K\\Ah\\ K . (28) 
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It remains to bound ||A/t||^. Our approach towards bounding || A/iH^ is similar to the one used by 
Bousquet and Elisseeff (2000), and relies on the convexity of h i— ► c(h, x). Note however, that in 
the case of c, the pseudo-targets may depend on the training set S. This dependency matters when 
we wish to apply convexity of two hypotheses h and h! obtained by training on different samples S 
and S'. For convenience, for any two such fixed hypotheses h and h', we extend the definition of c 
as follows. For all t G [0, 1], 

c(th + (l-t)h',x) = ((th + (1 - t)h')(x) - {ty + {l-t)y)f. (29) 

This allows us to use the same convexity property for c as for c for any two fixed hypotheses h and 
h! as verified by the following lemma. 

Lemma 10 Let h be a hypothesis obtained by training on S and h! by training on S'. Then, for all 
t G [0, 1], 

tc(h, x) + (1 - t)c{h',x) > c{th + (1 - t)ti, x). (30) 

Proof Let y = y(x) be the pseudo-target value at x when the training set is S and y' = y'(x) when 
the training set is S'. For all t G [0, 1], 

tc(h, x) + (1 - t)c(h', x) - c(th + (1 - t)h',x) 
= t{h(x) - yf + (1 - t)(h'(x) - y') 2 - [(th(x) + (1 - t)h'(x) - (ty + (1 - t)y')] 2 
= t(h(x) -y) 2 + (l- t)(ti(x) - y') 2 - [t(h(x) - y) + (1 - t)(h'(x) - y')] 2 . 

The statement of the lemma follows directly by the convexity of the function ihi 2 defined over 
M. ■ 

Recall that f3i oc denotes the score-stability of the algorithm that produces the pseudo-targets. In 
Lemma 12 we present an upper-bound || A/i|| which can then be plugged into Equation 28 to 
determine the stability of LTR. 

Lemma 11 Assume that for all x G X, \y{x)\ < M. Let S and S' be two samples differing 
by exactly one point. Let h be the hypothesis returned by the algorithm minimizing the objective 
function F(f, S), h! be the hypothesis obtained by minimization of F(f, S') and let y and y' be the 
corresponding pseudo-targets. Then for all i G [1, m + u], 

C C 

— \c{h! ,Xi) - c(h,Xi)} H \c(h',Xi) -c(h,Xi)} 

< 2AM [K\\Ah\\ K (- + — )+ Pioc—) , (31) 
\ \m u J u J 

where Ah = h! - h and A = 1 + k\JC + C. 
Proof From Equation 28, we know that: 

\c(h', Xi ) - c(h, Xi )\ < 2M(1 + kVC + C')n\\Ah\\ K . (32) 
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It remains to bound \c(h' ,Xi) — c(h, Xi)\. 

c(ti, Xi ) - c(h, Xl ) = (h'(x) - y'(x)f - (h(x) - y(x)f 

= {(h'(x) - y'(x)) + (h(x) - y(x))) (Ah(x) - (y'(x) - y(x))) 
< 2M(l + KVcTC J )(K\\Ah\\ K + (3 loc ) 

Here, we are using score-stability Pi oc of the local algorithm in \y'(x) — y(x)\ < (3i oc and that 
\h{x) - y{x)\ < M(l + Ky/C + C) when \y(x)\ < M (by Lemma 9). 

Plugging the bounds for \c(h',Xi) — c(h, Xi)\ and \c(h' , xi) — c(h, Xi)\ into the left hand side of 
Equation 3 1 yields the statement of the lemma. ■ 



Lemma 12 Assume that for all x £ X, \y(x)\ < M. Let S and S' be two samples differing 
by exactly one point. Let h be the hypothesis returned by the algorithm minimizing the objective 
function F(f, S), h' the hypothesis obtained by minimization of F(f, S') and let y and y' be the 
corresponding pseudo-targets. Then 

\\Ahf K < 2AM (n\\Ah\\ K (- + -)+ (3 loc —) , (33) 



where Ah = h! - h and A = 1 + ny/C + C 
Proof By the definition of h and h' , we have 



h = argmini ? (/, S) and h! = argmini ? (/, S'). 
feH fan 

Let t G [0, 1]. Then h + tAh and h! - tAh satisfy: 

F(h,S) - F(h + tAh,S) < (34) 
F(ti,S')-F(ti -tAh,S') < (35) 

For notational ease, let htA denote h + tAh and h' tA denote h' — tAh. Summing the two inequalities 
in Equations 34 and 35 yields: 

— [c(h, x k ) -c(htA,x k )] H [c(h, x m+k ) -c(h tA: x m+k )] + 

k=l k=l 

— ^2 [ c ( h '' x k) ~ c ( h 'tAi x k)] H ^2 [c( h '' x ™+k) ~ c(h' tA ,x m+k )] + 

k=l,kj^i k=l,kj^j 

c c' 

— \c(h',x m+j ) - c(h' tA ,x m+j )} H \c(h',Xi) - c(hf tA ,Xi)] + 

m u 

Wat-II^IIa' + II^IIk-II^aI^ <o- 

By the convexity of c(h, •) in h, it follows that for all k G [1, m + u] 

c(h, x k ) - c(h t A,x k ) > t [c(h, x k ) - c(h + Ah, x k )\ , (36) 
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and 

c(h', x k ) - c(h' tA ,x k ) > t [c(h',x k ) - c(ti - Ah, x k )] . (37) 
By Lemma 10, similar inequalities hold for c. These observations lead to: 

Ct m C't " 

— ^2 \- C ( h > ~ c (h',x k )} H ^2 \pi h -> x m+k) ~ c(h',x m+k )} + 

771 k=l U k=l 

Ct m C't u 

— ^2 [c(h',x k ) - c(h,x k )] + — ^2 \c(h',x m +k)-c(h,x m+k )] + 
Ct C't 

— \c(h',x m+j ) -c(h,x m+j )] H \c(h',Xi) - c(h,Xi)] + 

m u 

\\h\\ 2 K -\\h tA \\ 2 K + \\h'\\ 2 K -\\h' tA \\ 2 K <0. 

Let E denote \\h\\ 2 K — \\htA\\%; + H^'H^ — IIMaII^- Simplifying the previous inequality leads to: 

Ct 

E < — \c(h',Xi) - c(h,Xi) +c(h,x m+j ) - c(h',x m+j )] - 
m 

C't _ _ 

\c(h', Xi) - c(h, Xi) + c(h, x m+ j) - c(h',x m+j )} . 

u 

Let A = 1 + k\/C + C . Using Lemma 11 twice (with X{ and x rn+ j), the expression above can be 
bounded by 

E < 4AMt (K\\Ah\\ K (- + —)+ Pioc—) ■ (38) 
Finally, since \\h\\ 2 K = (h, h)x for any h £ H, it is not hard to show that: 

WHk ~\\ h + tAh \\K + Will; ~ W ~ tAh \\K = 2*II^IIa-(1 - *)■ (39) 
Using Equation 39 in Equation 38, it follows that: 

||A/i||^(l - t) < 2AM (4Ah\\ K (- + —)+ Pioc—) . (40) 
Taking the limit as t — > yields the statement of the lemma. ■ 

The following is the main result of this section, a stability-based generalization bound for LTR. 

Theorem 13 Assume that for all x £ X, \y(x)\ < M and there exists k such that for all x £ X, 
K(x, x) < k 2 . Further, assume that the local estimator has score-stability f3i oc . Let A = 1 + 
K\/ C + C Then, LTR is uniformly j3-cost-stable with 



(3 < 2{AM) 2 k 2 



~c 




'(- + 


— + 


-+v 




m 


u y 





x 2 2C'(3 loc 



+ 



AMk 2 u 
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Proof From Lemma 12, we know that 



\Ah\\ 2 K < 2AM ( K\\Ah\\ K (- + —)+ (3 loc — ) , (41) 



where Ah = h! — /i and A = l+n\/C + C . This implies that || Ah\\ is bounded by the non-negative 
root of the second-degree polynomial which gives 



\Ah\\ K < AMk 




AMk 2 u 



(42) 



Using the above bound on || A/i||^ in Equation 28 yields the desired bound on the stability coeffi- 
cient of LTR and completes the proof. ■ 

Our experiments with local transductive regression in Section 7 will show the benefit of this bound 
for model selection. 

5. Stability of unconstrained regularization algorithms 

5.1 Unconstrained regularization algorithms 

In this section, we consider a family of transductive regression algorithms that can be formulated as 
the following optimization problem: 

minh T Qh+ (h - y) T C(h - y), (43) 

h 

where Q G R( m + U ) x ("»+«) i s a symmetric regularization matrix, C G R( m +") x ("»+«) a symmetric 
matrix of empirical weights (in practice it is often a diagonal matrix), y G ]R( m+u ) xl the target 
values of the m labeled points together with the pseudo-target values of the u unlabeled points (in 
some formulations, the pseudo-target value is 0), and h G K( m + M ) xl a column matrix whose ith 
row is the predicted target value for the Xj. The closed-form solution of (43) is given by 

h = (C^Q + irV- (44) 

The formulation (43) is quite general and includes as special cases the algorithms of Belkin et al. 
(2004a); Wu and Scholkopf (2007); Zhou et al. (2004); Zhu et al. (2003). We present a general 
framework for bounding the stability coefficient of these algorithms and then examine the stability 
coefficient of each of these algorithms in turn. 

5.2 Score-based stability analysis 

For a symmetric matrix A G R nxn we denote by A^/(A) its largest and by A m (A) its smallest 
eigenvalue. Thus, for any v G R nxl , A m (A)||v|| 2 < ||Av|| 2 < Am(A)||v|| 2 . We will also 
use, in the proof of the following proposition, the fact that for symmetric matrices A, B G M nxn , 
X M (AB) < A m (A)Am(B). 

Theorem 14 Let h* and h'* solve (43), under test and training sets that differ exactly in one point 
and let C, C, y, y' be the corresponding empirical weight and the target value matrices. Then, 

llh* h'*ll <llh* h'*ll < l|y " y/| ' 2 I AM(Q)||C^-C-i|| 2 ||y'|| 2 
||h -h |U<||h -h || 2 < — + } (45) 

a A /(c) + \Mim + L ) {^icj + 1 
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Proof The first inequality holds as a result of the general relation between norm-infinity and norm- 
2. Let Ah* = h* - h'* and Ay = y - y'. By definition, 

Ah* =(C- X Q + I)-V - (C'^Q + I)-y (46) 
= (C 'Q + I)' 1 Ay + ((C^Q + I)" 1 - (C'^Q + I)" 1 )/ (47) 
= (C X Q + I)" 1 Ay + [(C'-'Q + ir 1 [(C'- 1 - C-^Q] (C^Q + I)" 1 ] y' . (48) 

Since HfC-iQ + 1]'% = A^QC^Q + I]" 1 ) = A^C^C- + I), and A m (C~ 1 Q + I) > 
MSI + 1, II Ah* || 2 can be bounded as follows: 

|[A^|[ 2 < l|Ay " 2 + ^(Q)\\C>-i-C-%\\y> h 
11 112 " A m (C^ 1 Q + I) x A m (C'- 1 Q + I)A m (C- 1 Q + I) 
< ||Ay|| 2 A M (Q)||C'- 1 -C- 1 || 2 ||y'|| 2 

_ Mfixl / A m (Q) , A/ Ug) , ^ 

This proves the second inequality. ■ 

The theorem helps derive score-stability bounds for various transductive regression algorithms 
(Zhou et al., 2004; Wu and Scholkopf, 2007; Zhu et al., 2003) based on the closed-form solution for 
the hypothesis. Recall that score-stability (Definition 2) is the maximum change in the hypothesis 
score on any point x as the learning algorithm is trained on two training sets that differ in exactly 
one point, that is precisely an upper-bound on ||h* — h'* H^. 



5.3 Application 

For each of the algorithms in (Zhou et al., 2004; Wu and Scholkopf, 2007; Zhu et al., 2003), an 
estimate of is used for unlabeled points. Thus, the vector y has the following structure: the entries 
corresponding to training examples are their true labels and those corresponding to the unlabeled 
examples are 0. 

For each one of the three algorithms, we make the bounded labels assumption (for all x G 
X, \y{x)\ < M for some M > 0). It is then not difficult to show that ||y - y'|| 2 < V2M and 
lly'lb < y/mM. Furthermore, all the stability bounds derived are based on the notion of score- 
stability (Definition 2). 

5.3.1 Consistency method (CM) 

In the CM algorithm (Zhou et al., 2004), the matrix Q is a normalized Laplacian of a weight matrix 
W G -^(m+u)x(m+u) ^ at ca pt ures affinity between pairs of points in the full sample X. Thus, 
Q = I— D -1 / 2 WD -1 / 2 , where D G r(™+«)x(™+") i s a diagonal matrix, with [D] M = Ej[ w ki- 
Note that A m (Q) = 0. Furthermore, matrices C and C are identical in CM, both diagonal matrices 
with (i, z)th entry equal to a positive constant \i > 0. Thus C _1 = C /_1 and using Proposition 14, 
we obtain the following bound on the score-stability of the CM algorithm: (3 C n < \/2M. 

5.3.2 Local learning regularization (ll - Reg) 

In the LL — Reg algorithm (Wu and Scholkopf, 2007), the regularization matrix Q is (I — A) T (I — 
A), where I G R(m+«)*("H-u) is an identity matrix and A G R(m+u)x(m+u) is a non-negative 
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weight matrix that captures the local similarity between all pairs of points in X. A is normalized, 
i.e. each of its rows sum to 1. Let Ci, C u > be two positive constants. The matrix C is a diagonal 
matrix with [C\a = Ci if Xi G S and C u otherwise. Let C max = max{Q,C u } and C mm = 

mm{Ci,C u \. Thus, ||C /_1 — C — 1 1 1 o = V% 7^ — I- By the PeiTon-Frobenius theorem, 

L J 11 11 \ (---mill Omax J 

its eigenvalues lie in the interval (—1, 1] and A a/ (A) < 1. Thus, A m (Q) > and Xm(Q) < 4 
and we have the following bound on the score-stability of the LL — Reg algorithm: Reg < 

^min '-'max / t-'rr 



5.3.3 Gaussian Mean Fields algorithm 



GMF (Zhu et al., 2003) is very similar to the LL — Reg, and admits exactly the same stability coeffi- 
cient. 

Thus, using our bounding technique, the stability coefficients of the algorithms of CM, LL — Reg, 
and GMF can be large. Without additional constraints on the matrix Q, these algorithms do not seem 
to be stable enough for the generalization bound of Theorem 8 to converge. The next section in fact 
demonstrates that by presenting a constant lower bound on their score-stability. 

5.4 Lower bound on stability coefficient 

The stability coefficient is a function of the sample size. For stability learning bounds to converge, 
it must go to zero as a function of the sample size. The following theorem proves that the stability 
coefficient of the CM algorithm is lower-bounded by a constant for some problems. A similar lower 
bound can be given for the other two algorithms examined. 

Theorem 15 There exists a transductive regression problem with m > 2 labeled samples and m 
unlabeled samples and a diagonal matrix C for which the score-stability (3 of the CM algorithm 
admits the following lower bound: 

<^bfr <51) 

Proof Consider a transductive regression problem with 2m instances where m instances have a tar- 
get value of and the other m instances a target value of 1 . Let the labeled sample S include exactly 
the instances x\, . . . , x m with target value and U be defined by the complement x m+ \, . . . , X2m- 
Let L denote an m x m normalized graph Laplacian matrix, with Is along the diagonal and all off- 
diagonal terms equal to — — ^y. Then the matrix Q is defined with the following block structure: 



Q 



L 
L 



(52) 



In our example, we set C to be a diagonal matrix with all its entries equal to the constant C. The 
matrix M = C~ 1 Q + I has the following block structure: 



M 



N 
N 



(53) 



where N is the m x m matrix whose diagonal entries are all equal to 1 + ^ and whose off-diagonal 
entries all equal to — g^-i) • 
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Now, consider the training sample S' obtained from S by swapping a labeled point with an 
unlabeled point. For the sake of convenience, let the index of this point be m. The y vector changes 
(to y') only in the ?nth position. Thus, all the entries of Ay = y — y' are zero except from its rath 
entry which equals 1. By Equation 48, Ah* = M _1 Ay, thus, Ah* is exactly the ?nth column of 
M _1 . Let [M ] m)m = [N _1 ] mjm denote the (m,m) entry of M _1 which coincides with the the 
(m, m) entry of N -1 . Since ||Ah*||2 > | [M _1 ] mjm |, to give a lower bound on || Ah* || 2, it suffices 
to lower bound | [N ] m m |. To do so, we can compute [N~ 1 ] mjm . 

By symmetry, the diagonal entries of N _1 are all equal to some value a, thus ma = Tr(N _1 ), 
which can be computed from the inverses of the eigenvalues of N. Observe that No = N — (1 + 
h + c(m-i) )I * s a maa "i x wrtn au entries equal to — c^-i) ■ Thus, it is a rank one matrix and its 
only non-zero eigenvalue coincides with its trace: Tr(No) = — C (^_x) ■ Since 1 + ^ + ^^-1) = 
^"(m-rlc" 1 ' tn * s snows that the eigenvalues of N are c ^_^ + ^"(~^^, m = 1 with multiplicity 1, 
and with multiplicity m—1. Thus, ma = T^N" 1 ) = 1 + ^^t-\)c+m ' w hich gives 



1 ±c 

a = 

since for m > 2, ± < 2i=l < 1. 



-I ™__ — > (54) 



An example of constraint that can help guarantee stability is the condition X^S=i" h( x i) = used 
in the algorithm of Belkin et al. (2004a). In the next section, we give a generalization bound for a 
family of algorithms based on this constraint this and then describe a general method for making 
the algorithms just examined stable. 



6. Stability of constrained regularization algorithms 
6.1 Constrained graph regularization algorithms 

Here, we examine constrained regularization algorithms such as the graph Laplacian regularization 
algorithm of Belkin et al. (2004a). Given a weighted graph G = (X, E) in which edge weights can 
be interpreted as similarities between vertices, the task consists of predicting the vertex labels. The 
input space X is thus reduced to the set of vertices, and a hypothesis h : X — > R can be identified 
with the finite-dimensional vector h of its predictions h= [h(x\), . . . , h(x m+u )] T . The hypothesis 
set H can thus be identified with M m+tl here. Let hg denote the restriction of h to the training 
points, [h(xi), . . . , h(x m )] T €M m , and similarly let y$ denote [y\, ... , y m ] T SM™. 

The general family of constrained graph regularization algorithms can then be defined by the 
following optimization problem: 

minh T Lh+-(h s -y 5 ) T (h 5 -y5) (55) 

he-ff m 

subject to: h T u = 0, 

where L G ^(m+u)x(m+u) j s a positive semi-definite symmetric matrix, i G [1,?™], the target 
values of the m labeled nodes, and u G M m+U a fixed vector. The constraint of the optimization thus 
restricts the space of solutions to be in Hi, the hyperplane in H of the vectors orthogonal to u. We 
denote by P the projection matrix over the hyperplane Hi. As further discussed later, for stability 
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reasons, u is typically selected to be orthogonal to the range of L, range(L). More generally, the 
optimizations constraint can be generalized to orthogonality with respect to a subspace U such that 
the space of solutions H\ be a subset of range(L). 

In the case of the regularization algorithm of Belkin et al. (2004a), L is the graph Laplacian. 
Thus, h T Lh = Ylij=i w ij{h(xi) — h(xj)) 2 , for some weight matrix (ujy). The vector u is defined 
to be 1, that is all its entries equal 1. For this algorithm, the authors further assume the label 
vector y to be centered, which implies that u T y = 0, and also that the graph G is connected. This 
last assumption implies that the zero eigenvalue of the Laplacian has multiplicity one and that H\ 
coincides with range(L). 

For a sample S drawn without replacement from X, define Ig G ^(m+u)x(m+u) as ^ e dj a g 0na i 
matrix with \S-s]i,i = 1 if %i G S and otherwise. Similarly, let ys E R( m+,1 ) xl be the vector with 
[yski = Ui if xi E 5 and otherwise. Then, the Lagrangian associated to the problem (55) is 
C = h T Lh + — (hg — yg) T (hg — yg) + /3h T u, where (3 G R is a Lagrange variable. Setting its 
gradient with respect to h to zero gives 

Lh+-(hg-ys)+/3u = 0. (56) 

771 

Multiplying by the projection matrix P gives 

P(L + -Ig)h = -Pyg - PPu = -Py s . (57) 

mm m 

6.2 Score stability of graph Laplacian regularization algorithm 

This section gives a simple generalization bound for the graph Laplacian regularization algorithm 
using a closed-form solution of (57) and a score-stability analysis. 

In the case of the graph Laplacian regularization algorithm of Belkin et al. (2004a) with the 
assumptions already indicated, matrix P(^L + Ig) is invertible. Then, Equation (57) gives the 
closed-form solution: 

h= [P(gL + Ig)] _1 Pyg, (58) 
which clearly verifies the constraint of the optimization problem. 

Theorem 16 Assume that the graph G = (X, E) is connected and that its vertex labels are bounded: 
for all x, \y(x)\ < M for some M > 0. Let h denote the solution of the optimization problem (55) 
where L is the graph Laplacian and u = 1, and let A = 1 + nyC. Then, for any 8 > 0, with 
probability at least 1 — 5, 



1>t (Mf(. + ,) | ^ 
mu /V 2 



where 



mu 1 , „ 4\/2M 2 



aim, u) = ; 77 ; rr and 3 < r— 1- -, 

V ' ; m + u- 1/2 1 - 1/(2 max{m,7j}) M ~ m\ 2 /C - 1 (mA 2 /C - l) 2 

A2 being the second smallest eigenvalue of the Laplacian L. 
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Proof Our proof is similar to that of Theorem 5 in (Belkin et al., 2004a), with the important 
exception that we no longer need to cope with vertex multiplicity in sampling since S is sampled 
from X without replacement. This makes our proof and the resulting bound considerably simpler 
and more concise. 

By Lemma 9 and Equation 25, since the labels are bounded by M, for any x, the following 
inequality holds: \h(x) — y(x)\ < M(l + n\/C) = AM. To determine the stability coefficient, it 
suffices to bound max^'Hhg — hs> ||oo> where S and S' are two training sets that differ only in one 
vertex. Let M 5 = P (§L + I s ) and M& = P (^L + I s ,). Then, 

||hs - h^Hoo < ||hs - h S '\\ (60) 

= HM^Pys - M^Pys'll (61) 

= ||M s 1 P(y s - y 8 >) + (M s l - M s })y s ,\\ (62) 

< UM^Pfo - y 5 ,)|| + || (M^ 1 - M^Py^. (63) 

For any column matrix v £ R( m + U ) xl ; by the triangle inequality and the projection property 
||Pv|| < ||v||, the following inequalities hold: 

, , tti , , , , m 

||-PL|| = ||-PL + PI s v-PI s v|| (64) 

< ||^PL + PI s v|| + ||PI s v|| (65) 

< Hpf^L + IsVll + Ps'vll- ( 66 ) 



C 



This yields the lower bound: 



( -L + I s ) v|| > -||PL|| - ||I s v|| > (-A 2 - 1 



which gives the following upper bound on ||Mc ||, ||Me, ||: 



IM^H— ^— - and ||M^||<— 



We bound each of the two terms, ||M 5 1 P(y 5 - y 5 /)|| and (((M^ 1 - M^Py^ separately. 
||M^ 1 P(y5 — ys')|| can be bounded straightforwardly: 

llM^Pfrs - y s >)\\ < W-fWiYs ~ Vs>)\\ < IIM^HHys - y<HI < ^ M . ■ (69) 

|| (Mj 1 — Mm )yg'|| is bounded as follows: 

IKM-i-M-^PycHI = HM^M^-M^M^Py-HI (70) 

= HM^P^-^M^Py^H (71) 

< - 1 72- (72) 
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This leads to the following bound on ||hs — rig/ 



oo • 



h s - h s , 



< 



V2M V2^M 



(73) 



£A 2 -1 (^A 2 -l) 2 



Note that this is the hypothesis stability of the algorithm. Let hs(xi) denote the predicted target 
value of the ith vertex under hs (i.e. the ith coordinate of hs). The cost-stability is given by: 



Substituting the upper bound on \\hs — hs'||oo derived in Equation 73 into the above expression 



The generalization bound we just presented differs in several respects from that of Belkin et al. 
(2004a). Our bound explicitly depends on both m and u while theirs shows only a dependency on 
m. Also, our bound does not depend on the number of times a point is sampled in the training set 
(parameter t), thanks to our analysis based on sampling without replacement. 

Contrasting the stability coefficient of Belkin's algorithm with the stability coefficient of LTR 
(Theorem 13), we note that it does not depend on C and 0i oc . This is because unlabeled points do 
not enter the objective function, and thus C' = and y(x) = for all ifl. However, the stability 
does depend on the second smallest eigenvalue A 2 and the bound diverges as A 2 approaches — . 
Actually, the bound in Theorem 16 will converge so long as A 2 = Q(l/m). As observed empirically 
by Cortes and Mohri (2007), this algorithm does not perform as well in comparison with LTR. 

6.3 Cost stability of graph Laplacian regularization algorithm 

Here we give a cost-stability analysis of the graph Laplacian regularization algorithm of Belkin et al. 
(2004a). To do so, we show that the algorithm can in fact be viewed as a special instance of the 
family of LTR algorithms. Theorem 13 can then be applied in this instance with a bound on the cost 
stability coefficient. 

To show that that the graph Laplacian algorithm is a specific LTR algorithm, we need to prove 
that the regularization term h T Lh corresponds to the square of a norm in some reproducing kernel 
HUbert space (RKHS). We show a more general result valid for all positive semi-definite symmetric 
matrices L. We denote by L + the pseudo-inverse of a matrix L. 

Theorem 17 Let H\ be a vector space such that H\ C range (L), then the regularization term 
h T Lh coincides with the square of the norm in the RKHS defined by the kernel matrix L + . 

Proof We need to show that there exists a kernel K such that h T Lh = ||h||^ for all h G Hi, 
where \\-\\k is the norm in the RKHS associated to K. This condition can be rewritten as h T Lh = 
(h, h) K , and more generally in terms of the inner product of h, h' G Hi as 



Let K denote the Gram matrix of K for the sample S. Select h' to be Ke^, where e, the ith unit 
vector of H. Then, the equality is equivalent to 




(74) 



yields the statement of the theorem. 




(75) 



Vi G [1, m + u], e 



i KLh = (Kei.h)^ = (K(x h -),h) K = K*i) = <*Jh, 



(76) 
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where we used the reproducing property of the inner product. Since the equality e^KLh = ejh 
holds for alH G [1, m + u], this is equivalent to the following, 

Vh G H\, KLh = h. (77) 

K = L + verifies this equality. Indeed, by the properties of the pseudo-inverse, L + L is the projec- 
tion over range(L). Since by assumption Hi C range(L), we can write L + Lh = h. ■ 

In the particular case of the graph Laplacian, when the graph is connected and the space Hi orthog- 
onal to 1 coincides with range(L) and the result of the theorem holds. 

Corollary 18 Any constraint optimization algorithm of the form (55) with Hi C range(L) is a 
special instance of the LTR algorithms. In particular, the graph Laplacian regularization algorithm 
ofBelkin et al. (2004a) is a specific instance of the LTR algorithms. 

The following theorem gives a bound on the cost stability of the graph Laplacian algorithm. 

Theorem 19 Assume that the hypothesis set H is bounded; that is, for all h G H, and x G X, 
\h(x) — y(x)\ < M. Then, the graph Laplacian regularization algorithm ofBelkin et al. (2004a) 
has uniform stability (3 with 

„ 4CM 2 r 1 i 

P< mini— ,p G \, (78) 

where A2 is the second smallest eigenvalue of the Laplacian matrix and pc the diameter of the graph 
G. 

Proof By Corollary 18, the graph Laplacian regularization algorithm of Belkin et al. (2004a) is 
a special case of the LTR algorithms. Thus, Theorem 13 can be applied to determine its stability 
coefficient, with the term AM bounding \h(x) — y(x)\ in that theorem replaced by M here: 

,3 < (79) 

771 

Furthermore, using the same techniques as (Herbster et al., 2005), we can bound h T L + h and thus 
k 2 in terms of the second smallest eigenvalue of the Laplacian matrix A2 and the diameter of the 
graph pc as: k 2 <min [j-, pc}- Substituting this upper bound in Equation 79 yields the statement 
of the theorem. ■ 

The following theorem gives a novel stability generalization bound for the algorithm ofBelkin et al. 
(2004a) in terms of the second eigenvalue of the Laplacian and the diameter of the graph. 

Theorem 20 Let H be a bounded hypothesis set. Let G be a connected graph with diameter pc 
and L be the associated Laplacian kernel with second smallest eigenvalue A2. Let S be a random 
subset of labeled points of size m drawn from the vertex set X. Let h be the hypothesis returned by 
Equation 55 when trained on X = (5, T). Then, for any e > 0, 



fl(h) < ffo) + jgj^ + fggj^ + M 2 (m + u)\ Ml/SHm, u) 
m \ m mu /V 2 

where a(m,u) = m+ ^" 1/2 and = mm{l/A 2 , pc}- 

Proof The result follows directly from Theorem 8 and the stability coefficient (3 derived in Theo- 
rem 19. ■ 
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6.4 General case 

The previous sections demonstrated the stability benefits of constraints of the type u T h = 0, which 
helped us bound the stability of the graph Laplacian regularization algorithm and derive stability- 
based generalization bounds. 

This idea and in fact much of the results presented for this particular algorithm can be general- 
ized. To ensure stability, it suffices that the optimization constraint restricts the hypothesis set Hi 
to be a subset of range(L). By Theorem 17, the regularization term then corresponds to an RKHS 
norm regularization. Orthogonality with respect to a single vector may not be sufficient to ensure 
Hi C range(L). In fact, this does not hold even for the graph Laplacian regularization algorithm if 
the graph G is not connected since the dimension of the null space of L is then more than one. But, 
the constraints can be augmented to guarantee this property by imposing orthogonality with respect 
to the null space. More generally, one might wish to impose orthogonality with respect to some 
space that guarantees that the smallest non-zero eigenvalue over Hi is not too small, for example 
by excluding eigenvalue A2 if it is too small. 

In particular, "stable" versions of the algorithms presented in Section 5 CM, LL — Reg, and GMF 
can be derived by augmenting their optimization problems with such constraints. Recall that the 
stability bound in Proposition 14 is inversely proportional to the smallest eigenvalue A m (Q). The 
main difficulty with using the proposition for these algorithms is that A m (Q) = in each case. Let 
v m denote the eigenvector corresponding to A m (Q) and let A2 be the second smallest eigenvalue of 
Q. One can modify (43) and constrain the solution to be orthogonal to v m by imposing h • v m = 
0. In the case of Belkin et al. (2004a), v m = 1. This modification, motivated by the algorithm 
of Belkin et al. (2004a), is equivalent to increasing the smallest eigenvalue to be A2. 

As an example, by imposing the additional constraint, we can show that the stability coefficient 
of CM becomes bounded by 0(C/A 2 ), instead of 6(1). Thus, if C = 0(1 /m) and A 2 = 9,(1), it is 
bounded by 0(l/m) and the generalization bound converges as 0(l/m). 

7. Experiments 

This section reports the results of experiments using our stability-based generalization bound for 
model selection for the LTR algorithm. A crucial parameter of this algorithm is the stability coeffi- 
cient j3i oc (r) of the local algorithm, which computes pseudo-targets y x based on a ball of radius r 
around each point. We derive an expression for (3i oc (r) and show, using extensive experiments with 
multiple data sets, that the value r* minimizing the bound is a remarkably good estimate of the best 
r for the test error. This demonstrates the benefit of our generalization bound for model selection, 
avoiding the need for a held-out validation set. 

The experiments were carried out on several publicly available regression data sets: Boston 
Housing, Elevators and Ailerons 1 . For each of these data sets, we used m = u, inspired by the 
observation that, all other parameters being fixed, the bound of Theorem 8 is tightest when m = u. 
The value of the input variables were normalized to have mean zero and variance one. For the 
Boston Housing data set, the total number of examples was 506. For the Elevators and the Ailerons 
data set, a random subset of 2000 examples was used. For both of these data sets, other random 
subsets of 2000 samples led to similar results. The Boston Housing experiments were repeated 
for 50 random partitions, while for the Elevators and the Ailerons data set, the experiments were 

1. www . liaad . up . pt / ~ ltorgo/Regression/DataSets . html . 
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Figure 2: MSE against the radius r of LTR for three data sets: (a) Boston Housing, (b) Ailerons. 

(c) Elevators. The small horizontal bar indicates the location (mean ± one standard devi- 
ation) of the minimum of the empirically determined r. 



repeated for 20 random partitions each. Since the target values for the Elevators and the Ailerons 
data set were extremely small, they were scaled by a factor 1000 and 100 respectively in a pre- 
processing step. 

In our experiments, we estimated the pseudo-target of a point x' G T as a weighted average of 
the labeled points x G N(x') in a neighborhood of x' . Thus, y x i = J2xeN(x') a xV{x) /YlxeNtx') a %- 
We considered two weighting approaches, as discussed in (Cortes and Mohri, 2007), defining them 
in terms of the inverse of the distance between $>(x) and &(x') (i.e. a x = (1 + \\$(x) — ^(x')!!) -1 ), 
and in terms of a similarity measure K(x,x') captured by a kernel K (i.e. a x = K(x,x')). In 
our experiments, the two approaches produced similar results. We report the results of kernelized 
weighted average with a Gaussian kernel. 



Lemma 21 Let r > be the radius of the ball around an unlabeled point x' G X that determines 
the neighborhood N(x') ofx' and let m(r) be the number of labeled points in N(x'). Furthermore, 
assume that the values of the labels are bounded (i.e. for all x G X, \y(x)\ < M for some M > 
and that all the weights in (7) are non-negative (i.e. for all x, a x > 0). Then, the stability coefficient 
of the weighted average algorithm for determining the estimate of the unlabeled point x' is bounded 
by: 

A oc <^fc, (8D 
where a max = vaax xeN ^ a x and a m - m = mm xtEN ( x >) a x . 

Proof We consider the change in the estimate as a point is removed from N(x') and show that this is 
at most 2 " max K . The statement of the lemma then follows straightforwardly from the observation 
that changing one point is equivalent to removing one point and adding another point. 

Let N(x') = {xi, . . . , x m ( r )}. For ease of notation, assume that n = m(r). Consider the effect 
of removing x n from the neighborhood N(x'). The estimate changes by: 

En s-^n—l 
i=i txiyi Li=i <*iVi 



n-l 

on 
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Thus, the stability [3i oc can be bounded as follows: 

En v^n— 1 



Ploc< 



En 

n-1 / 



< = h 2_^OLi\yi 



i=l 



En 
i=l a i 

a n \yn\ 



En— 1 
t=l 



En 
i=l 



+ 



En— 1 i 
i=i 



O:; 



2ft n M 2a max M < 



En 

2ft max M 



En 
i=l° 

which proves the statement of the lemma. 



ft, 



n a min m(r) ' 



Corollary 22 Using the notation of Lemma 21, the stability coefficient of the kernelized weighted 
average algorithm with a Gaussian kernel K with parameter a is bounded by: 

AM 
m{r)e ' 

Proof This follows directly from Lemma 21 using the observation that for a Gaussian kernel K, 
K(x,x') < 1, and for x,x' such that ||x|| < r and ||x'| < r, \\x — x'\\ < 2r. Thus, K{x,x') > 



Corollary 23 Using the notation of Lemma 21, the stability coefficient of the weighted average 
algorithm, where weights are determined by the inverse of the distance in the feature space, i.e. a x = 
(1 + \\$(x) - ^(x')Wy 1 is bounded by: 2 

(2r + l)2M 

Ploc < • 

m[r) 

Proof This follows directly from Lemma 21 using the observation that for all x € N(x'), 

< \\<£>(x) - $(x')| < 2r. 



To estimate [5i oc , one needs an estimate of m(r), the number of examples in a ball of radius r from 
an unlabeled point x'. In our experiments, we estimated m(r) as the number of labeled examples in 
a ball of radius r from the origin. Since all features are normalized to mean zero and variance one, 
the origin is also the centroid of the set X. 

We implemented a dual solution of LTR and used Gaussian kernels, for which, the parameter 
a was selected using cross-validation on the training set. Experiments were repeated across 36 

2. 1 is added to the weight to make the weights between and 1. 
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different pairs of values of (C, C). For each pair, we varied the radius r of the neighborhood used 
to determine estimates from zero to the radius of the ball containing all points. 

Figure 2(a) shows the mean values of the test MSE of our experiments on the Boston Housing 
data set for typical values of C and C. Figures 2(b)-(c) show similar results for the Ailerons and 
Elevators data sets. For the sake of comparison, we also report results for induction. The induction 
algorithm we chose was Kernel Ridge Regression (since it is analogous to LTRwith the choice of 
C = 0). The relative standard deviations on the MSE are not indicated, but were typically of the 
order of 10%. LTR generally achieves a significant improvement over induction. 

The generalization bound we derived in Equation 2 1 consists of the training error and a com- 
plexity term that depends on the parameters of the LTR algorithm (C, C, M, m, u, k, (3i oc , 5). Only 
two terms depend upon the choice of the radius r: R(h) and 0i oc . Thus, keeping all other param- 
eters fixed, the theoretically optimal radius r* is the one that minimizes the training error plus the 
slack term. The figures also include plots of the training error combined with the complexity term, 
appropriately scaled. The empirical minimization of the radius r coincides with or is close to r* . 
The optimal r based on test MSE is indicated with error bars. 

8. Conclusion 

We presented a comprehensive analysis of the stability of transductive regression algorithms with 
novel generalization bounds for a number of algorithms. Since they are algorithm-dependent, our 
bounds are often tighter than those based on complexity measures such as the VC-dimension. Our 
experiments also show the effectiveness of our bounds for model selection and the good perfor- 
mance of LTR algorithms in practice. Our analysis can also guide the design of algorithms with bet- 
ter stability properties and thus generalization guarantees, as discussed in Section 6.4. The general 
concentration bound for uniform sampling without replacement proved here can be of independent 
interest in a variety of other machine learning and algorithmic analyses. 
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